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ABSTRACT 

We present an analysis of anomaly detection for machine learning redshift estimation. 
Anomaly detection allows the removal of poor training examples, which can adversely 
influence redshift estimates. Anomalous training examples may be photometric galax¬ 
ies with incorrect spectroscopic redshifts, or galaxies with one or more poorly measured 
photometric quantity. 

We select 2.5 million ‘clean’ SDSS DR12 galaxies with reliable spectroscopic red¬ 
shifts, and 6730 ‘anomalous’ galaxies with spectroscopic redshift measurements which 
are flagged as unreliable. We contaminate the clean base galaxy sample with galaxies 
with unreliable redshifts and attempt to recover the contaminating galaxies using the 
Elliptical Envelope technique. We then train four machine learning architectures for 
redshift analysis on both the contaminated sample and on the preprocessed ‘anomaly- 
removed’ sample and measure redshift statistics on a clean validation sample generated 
without any preprocessing. We find an improvement on all measured statistics of up to 
80% when training on the anomaly removed sample as compared with training on the 
contaminated sample for each of the machine learning routines explored. We further 
describe a method to estimate the contamination fraction of a base data sample. 

Key words: galaxies: distances and redshifts, catalogues, surveys. 


1 INTRODUCTION 

Photometric surveys can be maximally exploited for large 
scale structure analyses once galaxies have been identified 
and their positions on the sky and in redshift space have 
been measured. Measuring accurate spectroscopic redshifts 
is costly and time intensive, and is typically only performed 
for a small subsample of all galaxies. For this subsample of 
galaxies one may learn a mapping between the measured 
photometric properties, and the spectroscopic redshift. This 
mapping can then be applied to all photometrically identi¬ 
fied galaxies to estimate redshifts. This is the basis of ma¬ 
chine learning, and inherently assumes that the galaxies used 
to construct the mapping form an unbiased and uncontam¬ 
inated sample of the final dataset. 

Recent work by the current authors shows that if the 
base training sample is biased compared to the final sample, 
it may be augmented, e.g., by adding galaxies from simula¬ 


tions, to make the data sets appear more similar (Hoyle et al. 


20151. The data augmentation process has been shown to 


improve the redshift estimate of the final test sample. In this 
paper we examine the problem of identifying poorly mea¬ 
sured galaxy properties which contaminate the base training 
set. The contamination may be due to incorrectly measured 
spectroscopic redshifts, or unreliable photometric proper¬ 
ties. 


Photometric redshifts are also estimated by parametric 
techniques, for example from galaxy Spectral Energy Dis¬ 
tribution (hereafter SED) templates. Some templates en¬ 
code our knowledge of stellar population models which re¬ 
sult in predictions for the evolution of galaxy magnitudes 
and colors. This parametric encoding of the complex stellar 
physics coupled with the uncertainty of the parameters of 
the stellar population models combine to produce redshift 
estimates which are little better than many non-parametric 
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techniques (see e.g., Hildebrandt et al. 2010 Dahlen 2013 


for an overview of different techniques). 

When a representative training sample is available, ma¬ 
chine learning methods offer an alternative to template 
methods to estimate galaxy redshifts. The ‘machine archi¬ 
tecture’ determines how to best manipulate the photometric 
galaxy input properties (or ‘features’) to produce a machine 
learning redshift. The machine attempts to learn the most 
effective manipulations to minimize the difference between 
the spectroscopic redshift and the machine learning redshift 
of the training sample. 

The held of machine learning for photometric redshift 
analysis has been developing since Tagliafc rri et al.] ( |2003| ) 
used artificial Neural Networks (aNNs). A plethora of ma¬ 
chine learning architectures, including tree based methods, 
have been applied to the problem of point prediction red¬ 
shift estimation (see e.g. Sanchez et al.|[2014~| for a further 


list and routine comparisons), or to estimate the full red- 


shift probability distribution function (hereafter pdf, 

Gerdes| 

et al.||2010 

|Carrasco Kind & Brunner 2013; Bonnett 2015 

Rau et al. 

2015 

1. Machine learning architectures have also 


had success in other fields of astronomy such as galaxy mor¬ 
phology identification, and star & quasar separation (see for 
example Lahav|1997 Yeche et a1.||2009 i. 

It is often assumed that the training data does not 
contain galaxies with unreliable spectroscopic redshift es¬ 
timates, or does not contain galaxies with incorrectly esti¬ 
mated photometric properties. However the contamination 
of a training sample can adversely affect the recovered ma¬ 


chine learning redshifts. The authors Cunha et al. (2014 1 use 


simulated spectra to show how the cosmological constraints 
for a weak lensing survey are degraded in the presence of 
even 1% of spectroscopic outliers in the training sample. 

Previous work on outlier analysis has been confined to 
examining the properties of the machine learning redshifts 
after the system has been trained. For example, photomet¬ 
ric redshift ‘outliers’ which actually sit in a different red¬ 
shift bin than expected, can be identified by cross correlating 
data across bins (see e.g., |Schneider et alT |2006| |Bernsteiii| 
& Huterer 2010 McQuinn & White 20131. Training data 


can also be carefully removed if the final machine learning 
redshift and the spectroscopic redshift are found to be very 
dissimilar (Cunha et al. 2014). More recently outlier detec¬ 
tion has been performed on galaxies after a pdf has been 
obtained, by examining the pdf for multiple peaks or other 
irregularities (Carrasco Kind & Brum ier|2 014a I. All of these 
methods enable the construction of a cleaner final sample of 
galaxies. However the cleaned sample must first be carefully 
checked to ensure that a sample bias has not been intro¬ 
duced before being used for scientific analysis. In particular 
the final test sample must be made to be representative of 
the cleaned training sample. 

In this paper we explore the effect of performing out¬ 
lier analysis, or anomaly detection, on the training sample 
to identify discrepant photometric data, or unreliable spec¬ 
troscopic redshifts before the sample is used to estimate a 
machine learning redshift. We then show how the removal of 
this anomalous data improves the machine learning redshift 
metrics for two very different groups of machine learning 
architectures. 

This paper is organized as follows: In (J2]we describe the 
data sample and the machine learning methods employed; 


We present the anomaly analysis and improvement to the 
redshift estimates using the anomaly detection in © and 
conclude in Cl 


2 DATA AND MACHINE ARCHITECTURES 

In this study we use observational data drawn from the fi¬ 
nal SDSS data release, and explore a selection of machine 
learning architectures for anomaly detection and machine 
learning redshift estimation. 


2.1 Observational dataset 

The observational data in this study are drawn from SDSS 


III Data Release 12 (Alam et al. 20151. The SDSS I-III 


uses a 4 meter telescope at Apache Point Observatory in 
New Mexico and has CCD wide field photometry in 5 bands 
( u,g,r,i,z Gunn et al. 2006; Smit h et al.|2002 |, and an ex¬ 
pansive spectroscopic follow up program ( |Eisenstein||2011| ) 
covering 7r steradians of the northern sky. The SDSS col¬ 
laboration has obtained more than 3 million galaxy spec¬ 
tra using dual fiber-fed spectrographs. An automated pho¬ 
tometric pipeline performs object classification to a mag¬ 
nitude of r « 22 and measures photometric properties of 
more than 100 million galaxies. The complete data sample, 
and many derived catalogs such as the photometric proper- 


ties, are publicly available through the CasJobs server (Li & 
Thakarl|2 ^ n?ltr 


The SDSS is well suited to the analyses presented in 
this paper due to the enormous number of photometrically 
selected galaxies with spectroscopic redshifts to use as train¬ 
ing and test samples. In particular if a galaxy spectra is 
obtained and subsequently found to be erroneous by the 
processing pipeline, the flag ‘zWarning’ is set to be larger 
than 0. For the SDSS dataset the quality flag zWarning is 
a good estimator of the reliability of the spectroscopic red¬ 
shift. This is not always true for other spectroscopic surveys 
or datasets e.g. PRIMUS (e.g. see Bonnet et al in prep. 
Coil et al. 2011 Cool et al. 20131. Furthermore the SDSS 


galaxies which have unreliably measured redshifts are often 
followed up at a later date and new spectra are obtained. 
Many of these new spectra often do not incur warnings dur¬ 
ing processing. It is exactly these cases which are utilized 
in this paper. Firstly we identify galaxies with at least one 
poorly measured spectrum and one well measured spectrum. 
Then we extract all occurrences of these galaxies from the 
base sample. We then assign the unreliably measured spec¬ 
troscopic redshift to the galaxy and then contaminate the 
clean base sample with this galaxy. Finally we use machine 
learning to try to identify the galaxies which have unreliable 
spectroscopic redshifts from those with reliable redshifts. 

We select all objects from CasJobs with both spectro¬ 
scopic redshifts and photometric properties which are clas¬ 
sified as galaxies by the photometric pipeline. This sample 
will also include some contamination from stars and quasars. 
In detail we run the MySQL query shown in the appendix. 
The query extracts 2.5M galaxies with a range of photomet¬ 
ric and spectroscopic qualities. The data selection is very 


skyserver.sdss3.org/CasJobs 
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Figure 1. Top panel: The distribution of absolute magnitude 
against redshift estimated for the galaxies with both an unreli¬ 
able and a reliable redshift. The reliable and unreliable data are 
shown by the circle and starred data points respectively. Bottom 
panel: The redshift distribution of the full galaxy sample by the 
solid grey line. The dashed orange line shows the reliable redshift 
distribution of those galaxies which have both a reliable and an 
unreliable redshift. We will remove these galaxies from the base 
sample. The joined dotted blue line shows the distribution of un¬ 
reliable redshifts for those galaxies which we have just removed. 
We use these galaxies with their unreliable redshift estimates to 
contaminate the base sample. 


relaxed in terms of allowed measured errors in both photo¬ 
metric and spectroscopic properties. In §3.4.2| we perform 
a similar analysis to that which follows but impose a more 
stringent set of selection criteria. This query also obtains 
galaxies with multiple spectra measurements and allows us 
to identify 76639 unique galaxies with ‘zWarning’> 0. Of 
these galaxies 9115 galaxies have both a poorly measured 
spectroscopic redshift above 0, and a well measured spec¬ 
troscopic redshift with an error less than 0.001. We next 
select galaxies which have a difference in poorly measured 
and well measured redshifts to be greater than 0.01 resulting 
in 6734 galaxies of which 3502 are unique. We impose this 
selection because we do not expect the error on the machine 
learning redshift estimate to be below 0.01. 

We use the SDSS k-correct (|Blanton fe Roweis||2007|) 


package to estimate the absolute R band magnitude of the 
6734 galaxies assuming both the reliable and unreliable spec¬ 
troscopic redshifts. We present the distribution of absolute 
R magnitude against redshift in the top panel of Fig. [l] 
and mark the reliable and unreliable data by the circle and 
starred data points respectively. The bottom panel of Fig. [I] 
shows the redshift distribution of the full galaxy sample by 
the solid grey line. The dashed orange line shows the reli¬ 
able redshift distribution of those galaxies which have both 
a reliable and an unreliable redshift. We will remove these 
galaxies from the base sample. The joined dotted blue line 
shows the distribution of unreliable redshifts for those galax¬ 
ies which we have just removed. We will use these galaxies 
with unreliable redshifts later to contaminate the base sam¬ 
ple. 

The top panel of Fig. [I] shows that both the redshift 
distribution and the absolute magnitude distribution of the 
galaxies with reliable and unreliable redshift estimates are 
very different. The unreliable spectroscopic redshift distri¬ 
bution is peaked at higher redshifts. The distribution of 
unreliable data is also peaked at brighter absolute magni¬ 
tudes. This is because the photometrically measured appar¬ 
ent magnitudes are unchanged, and therefore the offset is 
correlated with redshift. The bottom panel shows that the 
sample of galaxies with both reliable and unreliable redshifts 
are representative of the full base sample. As expected the 
unreliable redshift distribution appears to be very different 
from the redshift distribution of the base sample. 

In the analysis which follows we construct two train¬ 
ing samples. The first is drawn from the base sample of 
data with reliable redshifts which has then been contami¬ 
nated with anomalous data that has unreliable redshift esti¬ 
mates. The second training system is the first sample with a 
preprocessing step to remove anomalous data. We describe 
the method to pre-process the data in §2.2| Finally we con¬ 
struct a validation, or ‘test’ sample which is not used dur¬ 
ing training. The validation sample is always drawn from 
a non-overlapping set of base data which have reliable red¬ 
shifts. We describe the construction of these samples in more 
detail in §3.1| 

In this work we have concentrated on the following 
eight features for outlier estimation; the spectroscopic red¬ 
shift and error, r band magnitude, the following colors: 
g—i,g—r,r—i,z—r, and the Petrosian radius measured in the 
r band. Of course will only use the photometric quantities 
when estimating redshifts. Previous work has shown that 
there are many other readily obtained photometric features 
which also have strong predictive power when estimating 
redshifts (Ho yle et al.|20l¥| ). 

2.2 Anomaly identification 

We use the robust scikit-learn fPedregosa et al.~pb ll) pack¬ 
age called Elliptical Envelope routine as our base anomaly 
detector. We briefly describe the routine below, and refer 
the reader to |Hubert fc Debruyne[ (20 10) for a review. 

The Elliptical Envelope routine models the data as a 
high dimensional Gaussian distribution with possible covari¬ 
ances between feature dimensions. In short it attempts to 
find an boundary ellipse that contains most of the data. 
Any data outside of the ellipse is classified as anomalous. 
The Elliptical Envelope routine uses the FAST-Minimum 
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Covariance Determinate (Rousseeuw & Driessen 1999) to 


estimate the size and shape of the ellipse. 

In detail, the FAST-Minimum Covariance Determinate 
routine selects non overlapping subsamples of data and com¬ 
putes the mean ft. and covariance matrix C, in each feature 
dimension for each subsample. The Mahalanobis distance 
d mh , is computed for each multidimensional data vector x, 
in each subsample and the data are ordered ascendingly by 
&mh- The Mahalanobis distance is defined by 


d mh = y/{x- /2) T C _1 (x - fl) 


(1) 


which reduces to the Euclidean distance if the covariance 
matrix is the identity matrix, and the normalised Euclidean 
distance if the covariance matrix is diagonal. To summarise; 
the Mahalanobis distance measures how many ‘sigma’ a data 
point is from the mean of a distribution. 

The FAST-Minimum Covariance Determinate method 
continues by selecting subsamples from the original samples, 
with small values of Amh ■ The mean, covariance, and the 
values of Amh of the subsamples are again computed. This 
procedure is iterated until the determinate of the covariance 
matrix converges. The covariance matrix with the smallest 
determinate from all subsample forms an ellipse which en¬ 
compasses a fraction of the original data. Data within the 
ellipse surface are labeled as ‘inliers’, and data outside of 
the ellipse are labeled as ‘outliers’ or anomalous, which may 
then be removed. 

The hyper-parameter of the Elliptical Envelope routine 
is the contamination rate n c , which is the apriori assumed 
fractional contamination rate of the data sample. We explore 
this hyper-parameter in our subsequent analysis, but note 
that this parameter does not need to be known to high accu¬ 
racy before using the routine. We further present a method 
to estimate the contamination fraction from the data in j ]3.3| 
The contamination rate hyper-parameter n c , describes ap¬ 
proximately how much of the data sample should sit outside 
of the enclosing high dimensional ellipse that contains the 
majority of the data. 


2.3 Tree based methods 


One of the machine learning architectures to estimate galaxy 
redshifts used in this work is the scikit-learn implementation 
of decision trees for regression ( |Breiman et al.||l9 84). The 
tree based machine learning architecture recursively parti¬ 
tions the input feature dimensions into an increasing num¬ 
ber of bins. Each bin is chosen to minimize the scatter of the 
output feature, which for these purposes is the spectroscopic 
redshift. This results in data with very similar spectroscopic 
redshifts being within the same, or possibly nearby bins. 

The power of tree based methods is enhanced by com¬ 
bining many trees. One technique to do this is called 
Adaptive Boosting or Adaboost ([Freund & Schapire|l997; 
Drucker|1997 1 which adds trees sequentially to generate an 
ensemble of trees. In the following we will refer to this tech¬ 
nique as simply ‘Adaboost’. Adaboost weighs each new tree 
by its ability to predict redshifts correctly, and decides how 
new trees are grown such that redshift estimates are im¬ 
proved for the data with poorly estimated redshifts. For 


more details about combining trees with Adaboost we re¬ 


Hastie et al. 


(20010 


fer the reader to 

In this work we choose to fix the hyper-parameter set 
for a single decision tree and the final number of trees and 
the method of growing trees. We choose the number of data 
on each leaf node to be 10, and the number of trees to be 
100. For Adaboost we select the linear loss function, but we 
find that using the exponential loss function does not change 
the results significantly. We choose the linear loss function 
because the exponential loss function has previously been 
shown to be sensitive to label noise in classification problems 
I Dietterich|2000). We note that the best machine learning 
hyper-parameters are normally tuned by using a cross val¬ 
idation sample. We note that tuning the hyper-parameters 
of the model can have a large effect on the machine learning 
redshift predictions. 


2.3.1 Mean and median regression 


We also explore a type of tree based machine learning ar¬ 
chitecture called Quantile regression, which can include the 
use of the median value, as opposed to the mean value when 
constructing the loss function for regression trees. We use 
the scikit-learn package called GradientBoostingRegressor 
( Friedmanj 1999 2001) which accepts a parameter to deter¬ 
mine type of loss function, for example ‘least squares’ cor¬ 
responding to mean regression, and ‘quantile’ with a corre¬ 
sponding value of 50% for median regression. For both mean 
and median regression we again fix the hyper-parameters of 
the machine learning architecture to be the same as that of 
the section above, except for the choice of loss function. 

The loss function L(u), is the method that the learning 
algorithm uses to find the best fitting model parameters. For 
trees the best fitting parameters can be the numerical val¬ 
ues along the feature dimensions at which a split is chosen. 
The mean regression loss function is the least squares loss 
function given by 


L ( u ) = n £ ( Vi 


( 2 ) 


where the sum runs over each of the N data yi on each leaf 
node on the tree. The least squares loss function is sensitive 
to outliers, and so we would expect it to be more affected by 
outliers in the training set. For median regression the loss 
function is given by 


L ( u ) = ^(- I] (Vi ~ «) + I] (Vi - «)) = jfJ2\Vi- u \ 

Ui<.u yi^u i=0 

( 3 ) 

which is less sensitive to outliers because of the linear depen¬ 
dence on the differences between values yi and u. We com¬ 
pare the results of training these different architectures on 
contaminated, and outlier removed data samples in §3.4. 1| 


2.4 Self Organising Maps 


Another popular machine learning architecture is the Self 
Organising Map (Kohonen 1997 hereafter SOM), which 


2 statweb.stanford.edu/~tibs/ElemStatLearn 
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have recently been used for redshift estimation (Geach 
[2012J. SOMs are also being used in combination with tem¬ 
plate fitting routines for photometric redshifts (Greisel et 
al in prep). We use t h e public im pl ementatio n of a SOM, 
called SOMz (Carrasco Kind & Br unner||2014b p which we 
briefly describe below. We refer the reader to|Carrasco Kind| 


& Brunner (2014b I for more details. We choose to include 


SOMz in this paper because it represents a very different 
machine learning architecture than those of tree based meth¬ 
ods. Using both SOMs and trees suggests how generalisable 
the results found in this paper are. 

SOMz combine neural networks with dimensionality re¬ 
duction and similarity clustering. The SOMs are evolved 
from random starting weights such that training examples 
with similar high dimensional inputs appear clustered in a 
two dimensional space of pixels. The map evolution is un¬ 
supervised because it is performed by only examining the 
input features. Once the SOMz is stable, the training ex¬ 
amples are again passed through the map, and the values of 
the output feature are combined to produce an output value 
for each pixel. New data are passed through the SOMz and 
the pixel, or nearby pixels, which have the largest activation 
values contribute to the predicted value returned. 

In this work we choose to fix the hyper-parameters of 
the SOMz to have a spherical map geometry with 768 (= 
12x8x8) pixels and we perform 100 training iterations. For a 
full analysis on the effect of using different hyper-parameters 
with SOMz see Carrasco Kind & Brunner (2014b). Again we 
mention that tuning these hyper-parameters can lead to a 
large amount of improvement, however this is not the focus 
of this work. 


3 ANALYSIS AND RESULTS 

We first introduce the anomaly detection method to both 
identify the inserted contaminating galaxies with unreliable 
redshifts, and then to build a cleaner training sample. We 
then provide a method to estimate the contamination frac¬ 
tion of a dataset. Finally we train separately on both the 
full contaminated sample, and the cleaned sample, and show 
the effect on the measured statistics of the machine learning 
redshift as calculated on an independent and single cross 
validation sample. 


3.1 Anomaly identification 

We examine the ability of the Elliptical Envelope method 
to correctly identify the galaxies with unreliable redshift es¬ 
timates that we use to contaminate the training sample. 
We perform more than 250 sets of independent analysis 
for both the Adaboost, and SOMz machine learning archi¬ 
tectures. We initial randomly select a number N ur , where 
100< N ur <6730, of galaxies with unreliable redshifts, and 
combine them with N r randomly selected galaxies with reli¬ 
able redshifts from the base sample. We restrict N r to values 
3 * N ur < N r <100k. 


github.com/mgckind/MLZ 
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Figure 2. The percentage of correctly identified outlier galaxies 
as a function of the hyper-parameter n c , which measures the input 
contamination fraction best guess. The dispersion of data at fixed 
n c is due to the different randomly selected combined samples of 
random size. The dark lines show the mean of the distribution 
and the upper and lower shaded regions show the 68% spread of 
the data. The black error bar show the actual range of contami¬ 
nation fractions, which correspond to the number of galaxies with 
unreliable redshifts inserted into the base sample. Each of the 250 
experiments has a different inserted contamination fraction. 


We then construct a training and cross validation sam¬ 
ple from this combined sample and perform feature normal¬ 
ization on all of the features. Throughout this paper we en¬ 
sure that the cross validation sample is only drawn from 
those galaxies with a reliable redshift estimate, because it 
would be irrelevant to try to predict the redshift of galaxies 
with unreliable redshift estimates. For this training sam¬ 
ple we explore a range of values of the hyper-parameter n c , 
ranging from IIP 5 < n c < 0.5, corresponding to different 
initial ‘best guesses’ of the expected contamination fraction 
as used by the Elliptical Envelope routine. 

For each value of n c the anomaly detection code pro¬ 
duces a classification of either ‘inlier’ or ‘outlier’ for each 
galaxy. We determine the percentage of correctly identified 
outlier galaxies which have an unreliable redshift, and also 
the percentage of potentially incorrectly identified galaxies 
with a reliable redshift. We note that the galaxy sample with 
reliable redshifts may however be an outlier along a different 
feature dimension other than the spectroscopic redshift. 

In Fig. [2] we show the percentage of correctly identified 
galaxies with unreliable redshifts as a function of the con¬ 
tamination hyper-parameter n c . The dispersion of data at 
fixed n c is due to the different randomly selected combined 
samples of random size. The dark lines show the mean of the 
distribution and the upper and lower shaded regions show 
the 68% spread of the data. The black error bar shows the 
actual range of contamination fractions, that correspond to 
the number of galaxies with unreliable redshifts inserted into 
the base sample. Each of the 250 experiments has a different 
inserted contamination fraction. 

We find that the fraction of data with unreliable red¬ 
shifts which is classified as anomalous, or an outlier, is be¬ 
tween one and two orders of magnitude larger than the 
corresponding fraction of data with reliable redshifts. This 
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Figure 3. Top panel: The transparent circles show the redshift 
and apparent magnitude distribution of the base sample contam¬ 
inated with unreliable redshifts. The blue stars show which of 
those galaxies are classified as being outliers using the Elliptical 
Envelope technique for a given contamination hyper-parameter 
value n c = 0.1. Bottom panel: The absolute difference between 
the reliable and unreliable redshifts for the contaminating galaxies 
which are not classified as being outliers by the Elliptical Enve¬ 
lope technique as a function of increasing n c . 


demonstrates the success of the Elliptical Envelope tech¬ 
nique to identify data with unreliable redshift estimates. 

We next explore which of the contaminating data with 
unreliable redshifts is classified as anomalous, and show pro¬ 
jections through the data in Fig. [3] In the top panel of 
Fig-d the transparent circles show the redshift and ap¬ 
parent magnitude distribution of the base sample contami¬ 
nated with unreliable redshifts. The blue stars show which 
of those galaxies are classified as being outliers using the El¬ 
liptical Envelope technique for a given contamination hyper¬ 
parameter value n c = 0.1. The bottom panel of Fig. [^con¬ 
centrates on those contaminating galaxies with both a reli¬ 
able and unreliable redshift. The panel shows the absolute 
difference between the reliable and unreliable redshifts for 
the contaminating galaxies which are not classified as out¬ 
liers by the Elliptical Envelope technique as a function of 
increasing n c . 

The top panel of Fig. [3] shows that galaxies which oc¬ 


cupy a region of redshift and r band apparent magnitude 
space which is very different from the majority of other 
galaxies are classified as being anomalous. We also note that 
a small fraction of galaxies which occupy the same region of 
redshift and r band apparent magnitude space as the major¬ 
ity of galaxies, is also classified as anomalous. This could be 
because the data is anomalous along one or more different 
feature dimensions, which is not easily viewed in this two 
dimensional projection. There are three distinct clouds of 
data with reliable redshifts in the top panel. These clouds 
of data correspond to the different observing phases of the 
SDSS. 

The bottom panel of Fig. [3] shows that the number of 
galaxies with unreliable redshifts which are not classified as 
anomalous decreases as the contamination fraction hyper¬ 
parameter n c increases. We also note that the most extreme 
examples of galaxies with very anomalous unreliable red¬ 
shifts are preferentially removed as n c increases. The sharp 
drop at the x-axis location of 0.01 is due to the construction 
of the contaminating data sample. 

In the top panel of Fig. [3] there are distinct clouds of 
data in these feature projections. These are due to the dif¬ 
ferent observing strategies of the SDSS. For example most 
of the faint, high redshift cloud were observed in SDSS III, 
while the lower redshift clouds were observed in SDSSI/II. 
We also perform outlier detection separately for these sam¬ 
ples, and find the following similar trend in both samples: 
the fainter the galaxy is in the r band, the more likely it is 
to have an anomalously large unreliable redshift. This can 
be understood by the fainter galaxies being more difficult 
to observe spectroscopically and requiring larger integration 
times. 

We have also explored the use of One Class Support 
Vector Machines (Cortes & Vapnik 19951 as the machine 
learning anomaly detector, but do not find an improvement 
over the results using the Elliptical Envelope method. This 
suggests that a hyper dimensional ellipse provides a good 
model to enclose, and therefore identify, the non-anomalous 
data. 


3.2 The distribution of data with anomalies 
removed 

We explore how the distribution of galaxies changes as a 
function of the contamination hyper-parameter n c , as com¬ 
pared to the initial sample. We construct a sample of size 
100k which is contaminated with 3k galaxies with unreliable 
redshifts. 

We perform anomaly detection on the contaminated 
sample for different values of n c . In Fig.[4]we show the distri¬ 
bution of spectroscopic redshift against apparent magnitude 
in the r band, for three different values of n c indicated in 
each panel. The combined sample in each case is shown by 
the solid lines, and the sample with anomalous outliers re¬ 
moved is shown by the thick dotted lines. 

Fig .[fjshows that as the contamination hyper-parameter 
increases above n ^ 0.01 so the distribution of galaxies be¬ 
comes biased with respect to each other. For small values 
of n c the distributions are mostly unaffected. If there is no 
anomalous data, and the Elliptical Envelope routine is ex¬ 
pecting a large fraction of contaminated data, then even 
clean data is removed, however if anomalous data is indeed 


© 2010 RAS, MNRAS 000, [I]-?? 





















Anomaly detection for machine learning redshifts 7 



Figure 4. The distribution of training galaxies as a function of the contamination hyper-parameter n c . We show the full sample by 
the solid lines, and the sample with ‘anomalous’ galaxies removed by the dashed line. Each panel shows the change in the distributions 
when using a data sample of size 100k which has been contaminated with 3k galaxies with unreliable redshifts. 


present, then the routine will detect it. This behavior can 
also be seen in Fig. [2] 

In the next section we derive a prescription to estimate 
the contamination fraction from a base data sample that 
may be contaminated. 

3.3 Estimating the contamination fraction 

We next provide a prescription to make an empirical initial 
estimate for the contamination fraction. We note that the 
Elliptical Envelope method is not very sensitive to the exact 
value of the contamination fraction, as shown in Fig [2] and 
therefore we are interested in obtaining an order of magni¬ 
tude estimate. We use the measured values of Mahalanobis 
distance d mh, to estimate the contamination rate. 

To make this analysis more realistic we construct a base, 
and contaminated sample, with more stringent selection cri¬ 
teria on the allowed photometric and spectroscopic errors. 
We select galaxies which pass the following selection crite¬ 
ria: measured errors in r, g, i bands between 0 <error< 0.2 
and spectroscopic redshifts greater than 0 and spectroscopic 
redshift errors between 0 <error< 0.001. This reduces the 
base sample with reliable redshifts to 2.1M and the sample 
with unreliable redshifts to 3017. 

For this analysis we construct 250 datasets, which again 
contain a random amount of data with unreliable redshifts, 
and a random sample of base data with reliable redshifts. 
We use the Elliptical Envelope technique with a range of 
contamination fractions 0.001 < n < 0.5, to measure d mh 
of the data for each value of n c ■ We note that the dimen¬ 
sionality of the input feature space Nd, is 8, as described in 
m We then assign the class ‘outlier’ to data that satisfies 
d mh hJsjgi. We find that the choice of — 2 Nd pro¬ 
vides a good estimation for the outlier fraction, and discuss 
the robustness of this value below. 

Fig-0 shows the fractional contamination rate of data 
with unreliable redshifts inserted into the base sample 
against the estimated contamination fraction using the Ma¬ 
halanobis distance d mh- The error bars are inflated by a 
factor of 10, and show the 68% spread of results using differ¬ 
ent values of the contamination hyper-parameter n c , when 
using the Elliptical Envelope technique to measure d mh- 

We note that a large range 1.75 Nd < N a i g < 3 N ° of 
values corresponding to 88 < N a i g < 6560 also produce 
reasonable ‘order of magnitude’ estimates of the inserted 



Figure 5. The fractional contamination rate of data with unreli¬ 
able redshifts inserted into the base sample, against the estimated 
contamination fraction using the Mahalanobis distance d ,v/ h . We 
define contaminated galaxies as those that satisfy &mh > 2 N/J 
for the N jj feature dimensions of the data sample. The error bars 
are inflated by a factor of 10, and show the 68% spread of results 
using different values of the contamination hyper-parameter n c , 
when using the Elliptical Envelope technique to measure d mh- 

contamination fraction. As an illustrative example we could 
compare this result to the case of a two dimensional Gaus¬ 
sian distribution of width <r; this relationship is equivalent 
to assigning the classification of outlier to data that is more 
than 4cr away from the mean value. 

3.4 Machine learning redshifts from anomaly 
removed training data 

We next present the effect on the machine learning redshift 
if we train only on the training sample with anomalous data 
removed, instead of training on the full contaminated sam¬ 
ple. We remove anomalous data using the Elliptical Enve¬ 
lope technique. We choose to use Adaboost and SOMz in 
independent sets of analyses. 

In each set of analyses we first train on the contam¬ 
inated training sample, and then use the Elliptical Enve¬ 
lope method with a fixed contamination fraction hyper- 
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parameter n c , to remove anomalous data, irrespective of 
whether or not they are drawn from the sample with reli¬ 
able or unreliable redshift estimates. This produces a cleaned 
training set, which we then independently train on. We refer 
to this as the ‘cleaned’ training sample in what follows. 

We construct a cross-validation sample drawn from 
galaxies with reliable spectra. To make a fair comparison 
later, we do not modify the cross-validation sample at all, 
irrespective of their inlier or outlier definitions. We then pass 
the same cross-validation sample through both learned sys¬ 
tems, and obtain a machine learning redshift estimate z, for 
each galaxy. 

We construct the redshift scaled residual vector 
A z / —{z —specz)/(1+specz) and measure the following met¬ 
rics: \fi\, <768, < 795 , corresponding to the median value of A z /, 
and the values corresponding to the 68% and 95% spread of 
A z /, and we additionally measure the ‘outlier rate’ defined 
as the fraction of galaxies for which |A Z /| > 0.15. Note that 
the outlier rate here has a different, albeit related, definition 
from the anomaly detection sections. We repeat this anal¬ 
ysis for Adaboost and SOMz, and then repeat the entire 
analysis for a different value of n c ■ We perform 250 sets of 
experiments, each with a randomly selected initial training 
sample of data with reliable and unreliable redshifts, and 
with a randomly selected cross-validation sample. 

I 11 Fig. [6] we show the percentage relative improvement 
when training on the anomaly cleaned sample instead of 
the initial contaminated sample on each of the measured 
statistics, as a function of the hyper-parameter n c ■ In the 
left hand panel we show the results of the analysis with 
Adaboost, and in the right hand panel we show the results 
with SOMz. The lines and shaded regions again corresponds 
to the median and 68% of the distribution. 

In both sets of analysis we find that for very small val¬ 
ues of n c < 0.001, corresponding to a removal of 1% of data 
with unreliable redshifts, and 0.05% of data with reliable 
redshifts (see Fig. [ 2 J , we find a small improvement in the 
measured metrics at the level of a few percent or less. For 
increasing values of n c to 0.07, corresponding to a removal of 
70% of unreliable data and 3% of reliable data, we find the 
improvement in the metrics for both machine learning sys¬ 
tems with values between 20% and 80%. The metrics most 
affected by the removal of anomalous data are the median 
values, and the tails of the distribution, namely (795 and the 
outlier fraction |A Z /| > 0.15. Fig. [6] shows that there is a 
slight to moderate decline in improvement of the metrics at 
larger values of n c ■ This degradation in improvement can 
be understood by examining the effect of large n c on the 
resulting distributions of training galaxies as see in Fig. [4] 
For larger values of n c the cleaned samples become less rep¬ 
resentative of the initial sample, and therefore the training 
and test sets become less representative of each other, and 
the machine learning mapping extends into the realm of ex¬ 
trapolation. Extrapolating outside of the training set leads 
to spurious and degrading results, as seen here. 

Fig-© shows the relative improvement for each of the 
two machine learning techniques. We also perform a com¬ 
parison between these two machine learning architectures 
and show the results in the top two rows of Table © We 
note that this is not the main objective of this work because 
similar comparisons have already been performed (e.g., |Ca~ 
|rasco Kind & Brunner 2014b). The table shows the machine 


learning architecture used and the effect of training on both 
the data sample that is contaminated with unreliable red¬ 
shifts, and the data sample with outliers removed using the 
Elliptical Envelope technique. We show the the measured 
statistics in the final four column headings. The quoted val¬ 
ues are the median values at fixed n c = 0.07 of the 250 
samples that each have a different inserted contamination 
fraction. We note that Adaboost outperforms the SOMz al¬ 
gorithm on all metrics by a factor of > 2 when training on 
the contaminated samples, and is comparable with or out¬ 
performs the SOMz algorithm when training on the cleaned 
samples. We have chosen to show the results obtained for a 
contamination hyper-parameter value n c = 0.07, but note 
the same behavior is found for all values of n c . 

We note that both panels of Fig. ©show improvement 
as the base sample is cleaned of contaminating data. This 
shows that the machine learning routines for which the im¬ 
provement is the greatest, are the least robust techniques to 
use when presented with some fraction of anomalous train¬ 
ing data. We further explore other techniques which are less 
susceptible to anomalous training data in §3.4. 1| 

During this work we assume that the cross validation 
sample is not contaminated with anomalous data, which is 
true by construction. However this may not be true of other 
data sets. In such cases one could perform anomaly detec¬ 
tion on both the training, cross validation, and test sets to 
remove outliers from the full sample. If the sample anomaly 
detection results were applied to a final test sample, this 
would result in a fair analysis. However one would need to 
check that this preprocessed data is suitable for the final 
science application at hand. One further method would be 
to identify anomalous cross validation data, and then in¬ 
vestigate these data to understand why they have been so 
classified. 

3-4-1 Mean vs Median regression 

We next explore the machine learning architecture called 
mean and Quantile, or median, regression. Quantile regres¬ 
sion can use the median, as opposed to the mean value when 
constructing the loss function for boosted regression trees. 
We expect median regression to be less strongly affected by 
contamination in the training data. For comparison with 
§3.2| using Adaboost, we construct very similar machine 
learning architectures using the same hyper-parameters and 
only vary the loss function. We continue by applying the 
same formalism as before: we first train on the contaminated 
data sample, and then use the Elliptical Envelope method to 
remove outlier data, and finally retrain on the cleaned data 
sample. We show the results of using mean regression in the 
left hand panel of Fig. © and we show the results using the 
median regression in the right hand panel. Again we show 
the actual spread of the inserted contamination fraction us¬ 
ing the galaxy sample with unreliable redshifts is shown by 
the black starred data point and error bar. 

We find that both machine learning architectures show 
large improvement in the measured statistics |/r| and | A,/ > 
0.15| when the data sample is pre-cleaned using the Elliptical 
Envelope technique. This again shows how poorly the base 
routines perform on anomalous training data. As expected 
from the effect of outliers on the loss functions, we find that 
mean regression is more affected by contamination than me- 
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Figure 6. The percentage relative improvement when training on the cleaned sample instead of the full sample on each of the measured 
statistics, as a function of the hyper-parameter n c . In the left hand panel we show the analysis with Adaboost, and in the right hand 
panel we show the improvement with SOMz. The black error bar show the actual range of contamination fractions, which correspond 
to the number of galaxies with unreliable redshifts inserted into the base sample. Each of the 250 experiments has a different inserted 
contamination fraction. 
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Figure 7. These panels are similar to those in Fig. [6] and also include a data sample contaminated with unreliable redshifts. In the 
left hand panel we show the analysis with mean regression, and in the right hand panel we show the relative improvement using median 
regression. The black error bar show the actual range of contamination fractions, which correspond to the number of galaxies with 
unreliable redshifts inserted into the base sample. Each of the 250 experiments has a different inserted contamination fraction. 


dian regression. We note that the dispersion measures oes 
and ergs are very well controlled for the median regression 
architecture. We show the absolute values for each of the 
measured metrics in the third and fourth rows of Table, [l] 
We again show the values of each of the measured statis¬ 
tics, averaged over the 250 samples, for a chosen value of 
the contamination hyper-parameter n c = 0.07. 

Comparing quantile and median regression with the 
SOMz and Adaboost routines is not the primary focus of 
see e.g., |Dietterich] 2000; |Caruana fe Niculescu-| 
but we note that Adaboost with decision trees 
for regression is the best performing machine learning ar¬ 
chitecture on all measured statistics. However the continued 
success of Adaboost with contaminated data appears to be 
in disagreement with studies that include a large fraction 
of label noise in classification tasks |Dietterich||2000|. This 


this work ( 
Mkh|2005 


may be an artifact of the chosen datasets and how noise is 
added to the data. 


3-4-2 Anomaly detection using a cleaner galaxy sample 

In the previous sections we use data samples with very re¬ 
laxed selection criteria, which allows both photometric, and 
spectroscopic data with large measured errors to be included 
in the base sample. We now examine the effect on the ma¬ 
chine learning redshift if one chooses to use a base galaxy 
sample which has much more stringent limits of the allowed 
magnitude of both photometric and spectroscopic errors. We 
again select galaxies which pass the selection criteria de¬ 
scribed in §3.3| 

We repeat the above analysis by again contaminating 
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the base sample and using the Elliptical Envelope method 
to clean the sample, and then train Adaboost and SOMz for 
redshift analysis on both contaminated, and cleaned sam¬ 
ples. We again find a similar distribution of improvements 
in the redshift metrics as a function of the contamination 
hyper-parameter n c , but with a slightly reduced amplitude. 
The improvement for Adaboost ranges from 15% for the 
outlier fraction, to 85% for the median value, and the im¬ 
provement for SOMz ranges from 40% for aeg, to 95% for 
the median value. 


3-4-3 Anomaly detection of non-contaminated galaxies 


We also examine the effect on the machine learning redshift 
if one uses only the base galaxies with a reliable spectro¬ 
scopic redshift, without the addition of galaxies with unre¬ 
liable redshifts. We continue as before by determining in¬ 
here and outliers as a function of the hyper-parameter n c - 
In this section ‘anomalous data’ could mean that a photo¬ 
metric magnitude in a particular band is very different from 
other similar galaxies at that redshift. 

We proceed by again separately training Adaboost and 
the SOMz on both the initial training set and the cleaned 
training set. We present the results of this analysis in Fig. 
[8] Note that the y-axis scale is different between panels, and 
we have not shown |/x| due to the large scatter seen on this 
metric, caused by |p| being very small. 

If we adopt a contamination fraction hyper-parameter 
of n c < 0.005 and remove anomalous data, we find a very 
slight improvement at the level of ~ 1% using Adaboost 
and up to 4% using SOMz in the measured metrics. Note 
that the relative error on \fi\ is unstable, although \fi\ does 
remain small. As n c increases, the SOMz continue to benefit 
from a cleaned training sample, whereas Adaboost begins to 
degrade in its predictive power. 

The degradation in the measured statistics seen at large 
values of n c in Fig. [8] can be attributed to the removal of 
representative training data as seen in Fig. [4] Recall that the 
validation set is a random sample from the uncontaminated 
data with reliable redshifts, and thus would more closely 
resemble the solid lines in Fig. [I] For increasing values of 
n c , the training and validation samples become more unrep¬ 
resentative and a machine learning system would naturally 
degrade. We do note that SOMz appear to be less affected 
by small differences in the training and test data sets, but 
also degrade in predictive ability once the samples become 
very unrepresentative. 

In the last two rows of Table [T] we quote the median 
values on each of the measured statistics from each of the 
samples when both training on, and further cleaning, these 
uncontaminated galaxy samples. We note that the effect of 
training on the further cleaned sample improves the mea¬ 
sured statistics using SOMz by a few percent, but can de¬ 
grade some of the measured statistics by a few percent when 
using the Adaboost algorithm with decision trees. 

An interesting future application which is being ex¬ 
plored by the authors is to trim the anomalous data and 
then apply data augmentation (see Ho yle et al.|2015 I tech¬ 
niques to make the training and test samples again more 
representative. 


4 CONCLUSIONS 


Machine learning methods can be used to assign redshift 
estimates to photometrically selected galaxy catalogues if 
a representative training set with both photometric prop¬ 
erties or ‘features’ and spectroscopic redshifts exists. Ma¬ 
chine learning methods require that the base training sample 
which is used to learn the mapping between these quantities 
is representative of the final, or ‘test’, data sample. This re¬ 
quires that the training sample spans a similar input photo¬ 
metric feature space as the test sample, and does not contain 
anomalous data (e.g., galaxies with incorrect spectroscopic 
redshifts) otherwise an incorrect mapping will be learnt. In 
this work we examine the ability of machine learning archi¬ 
tectures to identify and remove such anomalous data. 

In contrast to previous work on outlier analysis which 
removes anomalous data after the machine learning redshift 
system has been trained (e.g., Schneider et al.|[2006 Bern- 


|stein &; Huterer|2 010[ [Carrasco K ind & BrunnerJ2014a|, the 

method presented here identifies anomalous data before the 

sample is used to estimate a redshift. The benefit of this ap¬ 
proach is that this pre-cleaning can then be used to define 
a new input feature space which is much less complex than 
using the post processing methods. Our method makes it 
easier to construct of a final sample of test galaxy. 

The analysis in this paper uses a base sample of 2.5M 
galaxies drawn from the SDSS DR12 which have reliably 
measured spectroscopic redshifts, and some of which also 
have an unreliably measured spectroscopic redshift. We con¬ 
struct an ‘anomalous data sample’ by selecting galaxies that 
have a difference between the reliable and the unreliable 
redshift by more than 0.01, and proceed by assigning the 
unreliable redshift to the galaxy. We apply this selection be¬ 
cause we do not expect the recovered photometric redshift 
to have an error below 0.01. We contaminate a base data 
sample with data from the anomalous sample, and then use 
machine learning to identify the anomalous data. 


We choose the Elliptical Envelope routine (Rousseeuw 
|fc Driessen||1999~j |Hubert fe Debruyne||2010| ) as the ma¬ 
chine learning anomaly detector algorithm. The resulting 
ellipse encompasses a fraction of the data which are classi¬ 
fied as ‘inhere’ and data outside of the ellipse are classified 
as ‘outliers 1 or anomalous data. We explored an alternative 
machine learning architecture for anomaly detection called 


One Class Support Vector Machines (Cortes & Vapnik 1995) 


and found that the Elliptical Envelope routine is more suit¬ 
able for our dataset. This implies that the high dimensional 
data cloud is well described by a hyper-ellipse, rather than 
a hyper-surface with distinct regions of reliable and unre¬ 
liable data which would be analysed more favourably us¬ 
ing Support Vector Machines. There is one hyper-parameter 
of the Elliptical Envelope routine which is the a priori as¬ 
sumed contamination fraction of the data set. We describe a 
method to estimate this fraction using a rule-of-thumb rela¬ 
tion between the distributions of Mahalanobis distances and 
the number of feature dimensions, but note the results are 
not very sensitive to the actual value assumed. 

We show how the removal of this anomalous data im¬ 
proves the machine learning redshift metrics for two different 
groups of machine learning architectures. We choose to ex¬ 
plore both decision tree based methods and artificial Neural 
Network based Self Organizing Maps. These very different 
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Figure 8. These panels are similar to those in Fig. [6] but with an initial data sample that does not contain galaxies with unreliable 
redshift estimates. Note that the y-axis scale is different between panels, and we have not shown |p|. 


Algorithm 

Sample 

<M> 

(<768> 

( CT 95) 

(|A Z ,| >0.15) 

SOMz 

Inlier&Outlier 

0.0324 

0.077 

0.289 

15.06 % 


Inlier only 

0.003 

0.032 

0.122 

3.35 % 

Adaboost 

Inlier&Outlier 

0.0046 

0.027 

0.167 

4.12 % 


Inlier only 

0.0014 

0.024 

0.111 

2.98 % 

Median regression 

Inlier&Outlier 

0.0092 

0.087 

0.176 

11.22 % 


Inlier only 

0.0034 

0.085 

0.172 

8.87 % 

Mean regression 

Inlier&Outlier 

0.0854 

0.086 

0.198 

29.18 % 


Inlier only 

0.0039 

0.078 

0.16 

6.75 % 

SOMz 

(no pre-cont.) Inlier&;Outlier 

0.0009 

0.03 

0.134 

3.27 % 


(no pre-cont.) Inlier only 

0.0003 

0.029 

0.119 

3.15 % 

Adaboost 

(no pre-cont.) Inlier&;Outlier 

0.0002 

0.023 

0.1 

2.58 % 


(no pre-cont.) Inlier only 

0.0003 

0.023 

0.11 

2.98 % 


Table 1. The absolute values of the different machine learning architectures applied to both data contaminated with unreliable redshifts 
and the data sample with outliers removed using the Elliptical Envelope technique. The final two rows show the results applied to the 
data sample without initial contamination. The measured statistics are shown in the column headings, and are measured on the redshift 
scaled residual distribution A z t. The quoted values are the median values at fixed n c = 0.07 of the 250 samples that each have a different 
inserted contamination fraction (for the top 4 rows only). The bottom two rows use data that is not initially contaminated, although it 
is also cleaned using the Elliptical Envelope technique, and are highlighted by ‘(no pre-cont.)’ 


architectures suggest that the results found here are gener- 
alisable, and not an artefact of the machine learning method 
chosen. We train the machine learning systems to estimate 
redshifts for a test sample separately on data from the base 
sample contaminated with unreliable redshift estimates, and 
with the cleaned base sample once anomalous data has been 
removed. 


We find improvement in the all of the explored met¬ 
rics when training on the cleaned sample compared with 
training on the contaminated sample, when comparing each 
machine learning method with respect to itself. We also com¬ 
pare the results across machine learning architectures, and 
find the best redshift estimation results are found using De¬ 
cision Trees boosted using the AdaBoost routine ( |Freund| 
fe Schapire||1997 Drucker||1997 1. This result has been seen 


before by the authors (Hoyle et al. 2015), however in that 
work the results are coupled with the enhanced ability of 


tree methods to use many tens, or hundreds of input feature 
dimensions. 

The SDSS data used in this work represents an optimal 
dataset because it covers a similar wavelength range in the 
photometry and spectrometry. Many other surveys do not 


tra drawn from heterogeneous sources. Performing outlier 
detection with a heterogeneous spectroscopic sample would 
still be possible as long as the photometry were not varying 
in depth drastically, otherwise even reliable data could be 
flagged as anomalous. If this is not the case, one potential 
avenue could be to degrade the entire photometry, or large 
fractions of it, to a similar depth and again perform outlier 
detection as described in this work. Furthermore we note 
that the spectroscopic quality flags for the SDSS data are 


have this luxury. For example the Dark Energy Survey (The 


Dark Energy Survey Collaboration 20051 has g, r, i, z, Y pho¬ 


tometry with varying depth across the sky, and have spec¬ 
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a good estimator of reliability. This is not always true for 
other datasets and spectroscopic surveys e.g., the PRIMUS 
dataset appears to have unreliable redshift estimates for 
some of the most secure redshifts provided by their qual¬ 
ity flags ( |Coil et al.|201l[|Cool et al.||2013| see Bonnet et al 
in prep). However one should still perform anomaly detec¬ 
tion, even with a less reliable data sample, or one may be 
learning trends from spurious data. 

In this work we have also assumed that the final test 
sample is not contaminated by data with unreliable spectro¬ 
scopic redshifts. If such a sample could not be constructed, 
this would not necessarily remove the usefulness of the tech¬ 
niques presented in this paper. This is because a contam¬ 
inated test sample would provide a similar detrimental ef¬ 
fect to any training sample and so they would be penalised 
equally. This is unless the pathological case exists in which 
galaxies with very similar photometry, and also similar un¬ 
reliable redshifts values inhabit both the training and test 
samples. 

An interesting avenue of future research would be to 
perform outlier detection on a data sample to remove 
anomalous training data. This may reduce the feature pa¬ 
rameter space such that the training sample is no longer rep¬ 
resentative of the test sample. One may then employ meth¬ 
ods from data augmentation (see Hoyle et al.| j2015l which 
enhances the training sample using third party data, from 
models, simulations or other dataset to make the training 
sample again representative of the test sample. This would 
work if the augmented data sample spans a similar input 
feature space (i.e. has the same measured photometric prop¬ 
erties) as the training and test samples. 

As with all machine learning works, the results found 
here should be applied cautiously to new datasets. Similar 
analysis to that described here should be performed to check 
if there is indeed a problem with contaminating data. If so, 
then we have shown that the removal of this contaminating 
data can greatly improve the machine learning redshift point 
estimates. 


A CASJOBS MYSQL QUERY 

We obtain observational data from the SDSS using the fol¬ 
lowing MySQL query which is run in the DR12 schema: 

select s.specObjID, q.objid, q.ra,q.dec, 
s.z as specz, s.zerr as specz_err, 
q.dered_u,q.dered_g,q.dered_r,q.dered_i,q.dered_z, 
q.modelMagErr_u,q.modelMagErr_g,q.modelMagErr_r, 
q.modelMagErr_i,q.modelMagErr_z, 
q.petroRad_r,q.petroRadErr_r, 

s.sourceType as specType, q.type as photpType, 
s.zWarning 

into mydb.specPhotoDR12v2 from SpecObjAll as s 
join photoObjAll as q on s.bestobjid=q.objid 
and q.dered_g>0 and q.dered_r>0 
and q.dered_z>0 and q.type=3 
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